DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
Recall that one assumption for inference on the population mean, \(\mu\), was that it was unimodal—as this was an indicator that one measure of centre was appropriate for the data
If this assumption was not met, often, the most logical explanation was that a categorical variable was not measured
Hence, we often collect more than the variable of interest when carrying out observational studies and conducting (randomised) experiments
Let’s start with the scenario where we measured the same numeric variable for two independent groups
If both population means, \(\mu_1 ~ \& ~ \mu_2\), and both population standard deviations, \(\sigma_1 ~ \& ~ \sigma_2\), are known—The ground “truths” (parameters) that summarise all possible values we could observe
The sampling distribution of the sample mean, \(\bar{x}_1 - \bar{x}_2\), is
\[ \bar{x}_1 - \bar{x}_2 ~ \text{approx.} ~ \text{Normal} \! \left(\mu_{\bar{x}_1 - \bar{x}_2} = \mu_1 - \mu_2, \sigma_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{(\sigma_1)^2}{n_1} + \frac{(\sigma_2)^2}{n_2}} \right) \]
The use of the \(\bar{x}_1 - \bar{x}_2\) subscripts is to make it clear that we are talking about the sampling distribution of \(\bar{x}_1 - \bar{x}_2\) and not the possible values we could observe
More on 1.
The standard error of the difference between sample means, \(\bar{x}_1 - \bar{x}_2\), is
\[ \text{se}(\bar{x}_1 - \bar{x}_2) = \sqrt{\frac{(s_1)^2}{n_1} + \frac{(s_2)^2}{n_2}} \]
where:
Do Aucklanders, on average, have the same gross weekly income as Wellingtonians in June quarter of 2011?
Are all four assumptions met?
# Reading in the data, then subset it to choose only Aucklanders
# and Wellingtonians
nzis.subset <- read.csv("datasets/NZIS-CART-SURF-2011.csv") |>
subset(region == "Auckland" | region == "Wellington")
histogram( ~ income | region, data = nzis.subset, nint = 50,
type = "count", xlab = "Gross Weekly Income ($)",
main = "NZer's gross weekly income snapshot in 2011")Do Aucklanders, on average, have the same gross weekly income as Wellingtonians in June quarter of 2011? -9.3062 290.7690458 17.0519514
# Use split() to report the sample means, sample standard
# deviations and the number of observations for each group
split(nzis.subset, ~ region) |>
lapply(\(x) mean(x$income))$Auckland
[1] 720.1581
$Wellington
[1] 729.4643
$Auckland
[1] 885.168
$Wellington
[1] 840.5936
$Auckland
[1] 9059
$Wellington
[1] 3459
\[ \bar{x}_1 - \bar{x}_2 \pm t^*_{1-\alpha/2}(\nu) \times \text{se}(\bar{x}_1 - \bar{x}_2) \]
where:
Do Aucklanders, on average, have the same gross weekly income as Wellingtonians in June quarter of 2011?
Construct and interpret a 95% confidence interval for the difference between population average gross weekly incomes of Aucklanders and Wellingtonians to answer this question
From Slide 10:
\(\phantom{\bullet} \bar{x}_{A} - \bar{x}_B = -9.3062\)
\(\phantom{\bullet} \text{se}(\bar{x}_{A} - \bar{x}_B) = 17.05 ~ (2 ~ \text{dp})\)
The appropriate t-multiplier is:
\(\phantom{\bullet} t^*_{0.975}(6557.4) = 1.96\)
The 95% CI for \(\mu_A - \mu_B \approx\)
\(\phantom{\bullet} (-42.7242, 24.1118)\)
Also, known as the two-sample t-test (for μ1 − μ2)
The material for one numeric variable and two groups (one categorical variable) presents it as \(\mu_1 - \mu_2\). Why…?1
Let’s consider an abstract set of null and alternative hypothesis statements
\(\phantom{\bullet} H_0 \! : \mu_1 = \mu_2\)
\(\phantom{\bullet} H_1 \! : \mu_1 \neq \mu_2\)
As is, it is not intuitive on how we would specify a hypothesised difference between the two means, \(\mu_1 ~ \& ~ \mu_2\)
Let’s consider the following abstract set of null and alternative hypothesis statements
\(\phantom{\bullet} H_0 \! : \mu_1 - \mu_2 = 0\)
\(\phantom{\bullet} H_1 \! : \mu_1 - \mu_2 \neq 0\)
We can now specify a hypothesised difference between the two means, \(\mu_1 ~ \& ~ \mu_2\)
\[ t_0 = \frac{(\bar{x}_1 - \bar{x}_2)- \text{Diff}_0}{\text{se}(\bar{x}_1 - \bar{x}_2)} \]
where:
Let \(T\) be the Student’s t-distribution with \(\nu = \ldots\), see Slide 28
If it is a two-sided test, e.g. \(H_1 \! : \mu_1 - \mu_2 \neq \text{Diff}_0\)
\(\quad p\text{-value} = 2 \times \mathbb{P}(T > |t_0|)\)
If it is a one-sided test and \(H_1 \! : \mu_1 - \mu_2 > \text{Diff}_0\)
\(\quad p\text{-value} = \mathbb{P}(T > t_0)\)
If it is a one-sided test and \(H_1 \! : \mu_1 - \mu_2 < \text{Diff}_0\)
\(\quad p\text{-value} = \mathbb{P}(T < t_0)\)
Do Aucklanders, on average, have the same gross weekly income as Wellingtonians in June quarter of 2011?
Conduct and interpret a hypothesis test at the 5% significance level to answer this question
From Slide 10:
\(\phantom{\bullet} \bar{x}_{A} - \bar{x}_B = -9.3062\)
\(\phantom{\bullet} \text{se}(\bar{x}_{A} - \bar{x}_B) = 17.05 ~ (2 ~ \text{dp})\)
Hypothesis statements:
\(\phantom{\bullet} H_0\!: \mu_A - \mu_B = 0\)
\(\phantom{\bullet} H_1\!: \mu_A - \mu_B \neq 0\)
The test statistic is:
\(\phantom{\bullet} t_0 \approx -0.54 ~ (2 ~ \text{dp})\)
The appropriate t-multiplier is:
\(\phantom{\bullet} t^*_{0.975}(6557.4) = 1.96\)
A group of researchers in 1982 noted that thiol concentrations within human blood cells are seldom determined in clinical studies, in spite of the fact that they are believed to play a key role in many vital processes. They reported a new reliable method for measuring thiol concentration (in mmol) and demonstrated that, in one disease at least (rheumatoid arthritis), the change in thiol status in the lysate from packed blood cells is substantial.
There were two groups of volunteers, the first group sampled from a population with “normal” thiol concentrations and the second group sampled from those who have rheumatoid arthritis.
| Variables | |
|---|---|
| concent | A number denoting the thiol concentration (in mmol) |
| type | A factor denoting the population the observation belonged to, Normal or Rheumatoid |
Are all four assumptions met?
Construct and interpret a 99% confidence interval for the difference in the average thiol concentrations between the “normal” and rheumatoid arthritis populations.
You may use the fact that the t-multiplier’s value for \(\nu\) is approximately \(5.2528\)
With 99% confidence, we estimate that the true mean thiol concentration for the rheumatoid population exceeds that of the normal population by somewhere between 0.83 and 2.26 mmol
# Calculate & save the necessary descriptive statistics
xbars <- split(thiol.df, ~ type) |>
lapply(\(x) mean(x$concent))
sds <- split(thiol.df, ~ type) |>
lapply(\(x) sd(x$concent))
ns <- split(thiol.df, ~ type) |>
lapply(\(x) nrow(x))
# Assign the t-multiplier
t.mult <- qt(0.995, df = 5.2528)
t.mult[1] 3.934093
Conduct a hypothesis test at the 1% signficance level to detect if the average thiol concentrations of the rheumatoid arthritis population is exceeds that of the “normal” population.
You may use the fact that the Student’s t-distribution’s \(\nu\) parameter is approximately \(5.2528\)
\(H_0\!: \mu_R - \mu_N = 0\)
\(H_1\!: \mu_R - \mu_N > 0\)
We have very strong evidence against the null that the average thiol concentrations of the rheumatoid arthritis and “normal” populations are equal, in favour of the alternative that the average thiol concentrations of the rheumatoid arthritis population is greater than that of the “normal” population (p-value = 0.0001)
A study randomly assigned students to take notes either longhand or using a laptop. The researchers had the students take a test after they wrote their notes. Does the data provide evidence of a difference in taking notes longhand rather than on a laptop?
| Variables | |
|---|---|
| score | An integer denoting the test score (unitless) |
| method | A factor denoting the note taking method, longhand or laptop |
Use the
t.test()function to construct a 95% confidence interval for the difference in the mean test score between the longhand and laptop note taking methods.
# Let's use the t.test() function for a two-sample t-test
t.test(score ~ method, notes.df, conf.level = 0.95)
Welch Two Sample t-test
data: score by method
t = -2.9773, df = 74.31, p-value = 0.003924
alternative hypothesis: true difference in means between group laptop and group longhand is not equal to 0
95 percent confidence interval:
-10.592821 -2.099284
sample estimates:
mean in group laptop mean in group longhand
19.07500 25.42105
R, by default, typically organises the levels of categorical variables in alphabetical order. To manually change the order, we need to make use of the factor() function
# Create a new variable to rotate the group order
notes.df$method.new <- factor(notes.df$method, levels = c("longhand", "laptop"))
# Let's use the t.test() function for a two-sample t-test
t.test(score ~ method.new, notes.df, conf.level = 0.95)
Welch Two Sample t-test
data: score by method.new
t = 2.9773, df = 74.31, p-value = 0.003924
alternative hypothesis: true difference in means between group longhand and group laptop is not equal to 0
95 percent confidence interval:
2.099284 10.592821
sample estimates:
mean in group longhand mean in group laptop
25.42105 19.07500
The Student’s t-distribution is an approximation for the sampling distribution of all possible test statistics, \(t_0\), for the two-sample t-test taught in DATAX121 (Wild & Seber, 2000)
Interestingly, the Student’s t-distribution is a very good approximation if the degrees of freedom parameter, \(\nu\), is set to the following:
\[ \nu = \frac{\left\{ \frac{(s_1)^2}{n_1} + \frac{(s_2)^2}{n_2} \right\}^2}{\frac{1}{n_1 - 1} \left\{\frac{(s_1)^2}{n_1}\right\}^2 + \frac{1}{n_2 - 1} \left\{\frac{(s_2)^2}{n_2}\right\}^2} \]
This equation for \(\nu\) is commonly known Sattherwaite’s approximation